This is part two of an analysis to to build a model that can predict if students are consuming dangerous amounts of alcohol. If you have already viewered part one of this analysis you will notice that there are many similarities. The primary difference is that there are only two risk groups in this analysis (Low and High) compared to the three risk groups (Low, Medium, High) that were in part one. The goal of this analysis is to see if changing the parameters of the problem can lead to a more predictive model. The model in part one struggled to identify students in the higher risk groups, partly because of the amount of data that we are working with, and how unbalanced the data is. By simifying the problem, the model will, ideally, become more accurate, and therefore more useful.

## 'data.frame':    649 obs. of  33 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  4 2 6 0 0 6 0 2 0 0 ...
##  $ G1        : int  0 9 12 14 11 12 13 10 15 12 ...
##  $ G2        : int  11 11 13 14 13 12 12 13 16 12 ...
##  $ G3        : int  11 11 12 14 13 13 13 13 17 13 ...

It’s a rather small dataset that we are working with, and I’m worried that the data is not optimally reflective of the average student. We are only looking at students who have taken a Portugese class, an elective course, for which there might be a ‘type’ of student that takes this class. I expect that we would have a more accurate sense of the average student if we had data from a mandatory class, such as math or English. Nonetheless, I hope we will have some interesting and useful findings.

Univariate Analysis

## [1] "Percent of students in each drinking level:"
## 
##          1          2          3          4          5 
## 0.69491525 0.18644068 0.06625578 0.02619414 0.02619414

The majority of the students consume very little, if any, alcohol during the week, but about 5% of students drink significant amounts (values >= 4).

## [1] "Percent of students in each drinking level:"
## 
##          1          2          3          4          5 
## 0.38058552 0.23112481 0.18489985 0.13405239 0.06933744

Clearly students are drinking much more alcohol on weekends compared to weekdays. The percent of signifiant drinkers (values >= 4) jumped from about 5% to just over 20%. Those drinking little to no alcohol (value = 1) was reduced by nearly half (from 69% to 38%).

## [1] "Correlation between weekday and weekend:"
## [1] 0.6165614

Of the 649 students, 241 (or 37.1%) of them drink little to no alcohol. 210 of the 451 (46.6%) students who do not drink during the week, consume some alcohol on the weekends. It is very rarely the case that students drink more during the week than on weekends.

## [1] "Percent of students in each group:"
## 
##           2           3           4           5           6           7 
## 0.371340524 0.178736518 0.152542373 0.112480740 0.077041602 0.049306626 
##           8           9          10 
## 0.026194145 0.009244992 0.023112481

For the remainder of this analysis, we are going to be dividing the students into one of two groups. ‘Low Risk’ for students with total alcohol consumption values <= 5 and ‘High Risk’ for values > 5 or having either the ‘Weekday’ or ‘Weekend’ value >= to 4.

## [1] "Percent of students in each risk group:"
## 
##      Low     High 
## 0.770416 0.229584

Although it is good that the majority of the students are in the low risk group, it will be very important to build a model that can accurately predict which students are in the high risk group, so that they can receive the guidance they need to stop this detrimental behaviour.

Bivariate Analysis

## [1] "Number of students attending each school"
## 
##      Gabriel Pereira Mousinho da Silveira 
##                  423                  226
## 
##      Gabriel Pereira Mousinho da Silveira 
##             0.651772             0.348228

Althought there are more students attending Gabriel Pereira, the relative number of students in each risk group is about the same.

## [1] "Number of students of each gender:"
## 
## Female   Male 
##    383    266
## 
##    Female      Male 
## 0.5901387 0.4098613

Despite 59% of the students being female, 68% of those in the high risk group are males.

## [1] "Number of students of each age:"
## 
##  15  16  17  18  19  20  21  22 
## 112 177 179 140  32   6   2   1
## 
##          15          16          17          18          19          20 
## 0.172573190 0.272727273 0.275808937 0.215716487 0.049306626 0.009244992 
##          21          22 
## 0.003081664 0.001540832
## [1] "Correlation between age and risk"
## [1] 0.1207322

Although there are fewer older students to make this conclusion with, it seems that as students age, they drink more.

## [1] "Number of students of each type of address:"
## 
## Rural Urban 
##   197   452
## 
##     Rural     Urban 
## 0.3035439 0.6964561

Many more students live in urban areas, but this does not tell us anything about the risk of alcohol abuse.

## [1] "Number of students of each family size:"
## 
## Greater than 3    Less than 3 
##            457            192
## 
## Greater than 3    Less than 3 
##      0.7041602      0.2958398

Students from smaller families are slightly more likely to be at risk of alcohol abuse.

## [1] "Number of students of each parental marriage status group:"
## 
##    Apart Together 
##       80      569
## 
##     Apart  Together 
## 0.1232666 0.8767334

I was expecting to see an effect here, but there doesn’t appear to be one.

## [1] "Number of students for each level of education (Mother):"
## 
##                None           4th Grade    5th to 9th Grade 
##                   6                 143                 186 
## Secondary Education    Higher Education 
##                 139                 175
## 
##                None           4th Grade    5th to 9th Grade 
##         0.009244992         0.220338983         0.286594761 
## Secondary Education    Higher Education 
##         0.214175655         0.269645609

Oddly the relationship between risk and mother’s education level seems to alternate with each level of education. 4th Grade & Secondary Education: High risk, 5th-9th Grade & Higher Education: Low risk.

## [1] "Number of students for each level of education (Father):"
## 
##                None           4th Grade    5th to 9th Grade 
##                   7                 174                 209 
## Secondary Education    Higher Education 
##                 131                 128
## 
##                None           4th Grade    5th to 9th Grade 
##          0.01078582          0.26810478          0.32203390 
## Secondary Education    Higher Education 
##          0.20184900          0.19722650

Differing from the mother’s education level, there doesn’t appear to be any relationship between a father’s education level and their child’s drinking habits.

## [1] "Correlation between mothers' and fathers' education levels:"
## [1] 0.6474766

There is a reasonably strong correlation between a mother’s and a father’s education level. This helps to explain the similarities in the plots we just saw.

## [1] "Number of students for each type of job (Mother):"
## 
##  At_Home   Health    Other Services  Teacher 
##      135       48      258      136       72
## 
##    At_Home     Health      Other   Services    Teacher 
## 0.20801233 0.07395994 0.39753467 0.20955316 0.11093991

There doesn’t appear to be any strong relationship here.

## [1] "Number of students for each type of job (Father):"
## 
##  At_Home   Health    Other Services  Teacher 
##       42       23      367      181       36
## 
##    At_Home     Health      Other   Services    Teacher 
## 0.06471495 0.03543914 0.56548536 0.27889060 0.05546995

This looks more significant. If a father works in services, it seems that their child is more likely to abuse alcohol.

## [1] "Number of students for each type of reason:"
## 
## Course Perference     Close to Home             Other School Reputation 
##               285               149                72               143
## 
## Course Perference     Close to Home             Other School Reputation 
##         0.4391371         0.2295840         0.1109399         0.2203390

If the school was chosen based on its reputation, the student appears less likely to abuse alcohol.

## [1] "Number of students for each type of guardian:"
## 
## Father Mother  Other 
##    153    455     41
## 
##     Father     Mother      Other 
## 0.23574730 0.70107858 0.06317411

It is interesting to see that so many students chose their mother as their primary guardian, yet so few students having separated parents. If the guardian is ‘Other,’ the student is more likely to be in a higher risk group.

## [1] "Number of students for each travel time group:"
## 
## Less than 15     15 to 30     30 to 60 More than 60 
##          366          213           54           16
## 
## Less than 15     15 to 30     30 to 60 More than 60 
##   0.56394453   0.32819723   0.08320493   0.02465331
## df$Risk: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   1.532   2.000   4.000 
## -------------------------------------------------------- 
## df$Risk: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   1.691   2.000   4.000
## [1] "Correlation between travel time and risk"
## [1] 0.08954301

We can see from the mean values, that as travel time increaseses, students are slightly more likely to abuse alcohol. However, the relationship is not overly strong as seen by the median, 3rd quartile values, and correlation.

## [1] "Number of students for each study time group:"
## 
##  Less than 2 Hours       2 to 5 Hours      5 to 10 Hours 
##                212                305                 97 
## More than 10 Hours 
##                 35
## 
##  Less than 2 Hours       2 to 5 Hours      5 to 10 Hours 
##         0.32665639         0.46995378         0.14946071 
## More than 10 Hours 
##         0.05392912
## df$Risk: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   2.022   2.000   4.000 
## -------------------------------------------------------- 
## df$Risk: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   1.624   2.000   4.000
## [1] "Correlation between time spent studying and risk:"
## [1] -0.2018618

There is a reasonable relationship here. If a student spends less time studying, s/he is more likely to abuse alcohol. Note: The values for ‘Time Spent Studying’ (1,2,3,4) represent ‘Less than 2 Hours’, ‘2 to 5 Hours’, ‘5 to 10 Hours’, and ‘More than 10 Hours’.

## [1] "Number of students for each failure group:"
## 
##   0   1   2   3 
## 549  70  16  14
## 
##          0          1          2          3 
## 0.84591680 0.10785824 0.02465331 0.02157165
## df$Risk: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.184   0.000   3.000 
## -------------------------------------------------------- 
## df$Risk: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.349   0.000   3.000
## [1] "Correlation between number of failed classes and risk:"
## [1] 0.1170598

As a student fails more classes, it seems that they are more likely to abuse alcohol. To simplify the relationship, let’s look at students having failed at least one class versus their risk group.

## [1] "Number of students for each failed group:"
## 
##  No Yes 
## 549 100
## 
##        No       Yes 
## 0.8459168 0.1540832
## df$Risk: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   1.132   1.000   2.000 
## -------------------------------------------------------- 
## df$Risk: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   1.228   1.000   2.000

This should help to make the relationship look clearer. If a student has failed at least one class they are slightly more likely to abuse alcohol.

## [1] "Number of students for each educational support group:"
## 
##  No Yes 
## 581  68
## 
##        No       Yes 
## 0.8952234 0.1047766
## df$Risk: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   1.114   1.000   2.000 
## -------------------------------------------------------- 
## df$Risk: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   1.074   1.000   2.000

No strong relationship here.

## [1] "Number of students for each educational support group:"
## 
##  No Yes 
## 251 398
## 
##        No       Yes 
## 0.3867488 0.6132512
## df$Risk: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   1.642   2.000   2.000 
## -------------------------------------------------------- 
## df$Risk: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   1.517   2.000   2.000

Student who did not receive education support from family are more likely to abuse alcohol.

## [1] "Number of students for each paying group:"
## 
##  No Yes 
## 610  39
## 
##         No        Yes 
## 0.93990755 0.06009245

Paying for extra classes doesn’t really change a student’s drinking habits.

## [1] "Number of students for each activity group:"
## 
##  No Yes 
## 334 315
## 
##        No       Yes 
## 0.5146379 0.4853621

Students in the high risk group are more likely to participate in extra-cirricular acitivities.

## [1] "Number of students for each nursery group:"
## 
##  No Yes 
## 128 521
## 
##        No       Yes 
## 0.1972265 0.8027735

Attending nursery school as a young child looks to slightly decrease the likelihood of drinking excessive when older.

## [1] "Number of students for each education group:"
## 
##  No Yes 
##  69 580
## 
##        No       Yes 
## 0.1063174 0.8936826

Students that are less inclinded to attend higher education are more likely to drink excessive amounts of alcohol.

## [1] "Number of students that have internet at home:"
## 
##  No Yes 
## 151 498
## 
##        No       Yes 
## 0.2326656 0.7673344

There doesn’t seem to be much of a relationship between alcohol consumption and internet access at home, however, I am surprised by the number of students that do not have access at home (data is from 2008)

## [1] "Number of students for each relationship group:"
## 
##  No Yes 
## 410 239
## 
##        No       Yes 
## 0.6317411 0.3682589

No strong relationship between having a significant other and alcohol consumption.

## [1] "Number of students for each quality group:"
## 
##  Very Bad       Bad   Average      Good Excellent 
##        22        29       101       317       180
## 
##   Very Bad        Bad    Average       Good  Excellent 
## 0.03389831 0.04468413 0.15562404 0.48844376 0.27734977
## df$Risk: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   4.000   4.000   3.972   5.000   5.000 
## -------------------------------------------------------- 
## df$Risk: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   4.000   3.792   5.000   5.000

Students in the high risk group, typically have worse family relationships.

## [1] "Number of students for each time group:"
## 
##  Very Low       Low   Average      High Very High 
##        45       107       251       178        68
## 
##   Very Low        Low    Average       High  Very High 
## 0.06933744 0.16486903 0.38674884 0.27426810 0.10477658
## df$Risk: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   3.000   3.132   4.000   5.000 
## -------------------------------------------------------- 
## df$Risk: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   3.000   3.342   4.000   5.000
## [1] "Correlation between Amount of free time after school and risk:"
## [1] 0.08420331

There is a weak, but positive relationship between amount of free time after school and alcohol consumption.

## [1] "Number of students for each social group:"
## 
##  Very Low       Low   Average      High Very High 
##        48       145       205       141       110
## 
##   Very Low        Low    Average       High  Very High 
## 0.07395994 0.22342065 0.31587057 0.21725732 0.16949153
## df$Risk: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   2.974   4.000   5.000 
## -------------------------------------------------------- 
## df$Risk: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   4.000   3.893   5.000   5.000
## [1] "Correlation between frequency of going out with friends and risk"
## [1] 0.328838

This could be the most differentiating feature that we have seen yet. We can clearly see that students who go out with their friends more often are more likely to be in the high risk group.

## [1] "Number of students for each health group:"
## 
##  Very Bad       Bad  Mediocre      Good Very Good 
##        90        78       124       108       249
## 
##  Very Bad       Bad  Mediocre      Good Very Good 
## 0.1386749 0.1201849 0.1910632 0.1664099 0.3836672
## df$Risk: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   4.000   3.456   5.000   5.000 
## -------------------------------------------------------- 
## df$Risk: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   4.000   3.805   5.000   5.000
## [1] "Correlation between health and risk:"
## [1] 0.1016733

We could be seeing the bias of a personal survey here. Despite having a very unhealthy habit, those in the high risk group consider themselves to be the healthiest.

## df$Risk: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   2.000   3.264   4.000  32.000 
## -------------------------------------------------------- 
## df$Risk: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   4.000   4.987   8.000  22.000
## [1] "Correlation between number of absences and risk:"
## [1] 0.1562277

Students in the high risk group seem to miss more classes than those in the low risk group.

## [1] "Period 1"
## df$Risk: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   10.00   12.00   11.68   14.00   19.00 
## -------------------------------------------------------- 
## df$Risk: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00    9.00   10.00   10.45   12.00   17.00
## [1] "Period 2"
## df$Risk: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   10.00   12.00   11.89   14.00   19.00 
## -------------------------------------------------------- 
## df$Risk: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   10.00   10.51   12.00   18.00
## [1] "Period 3"
## df$Risk: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   10.00   12.00   12.26   14.00   19.00 
## -------------------------------------------------------- 
## df$Risk: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   10.00   11.00   10.72   12.00   19.00

Students in the high risk group have the worst grades on average.

## [1] "Period 2 grades minus period 1 grades"
## df$Risk: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -9.000  -1.000   0.000   0.204   1.000  11.000 
## -------------------------------------------------------- 
## df$Risk: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -8.0000 -1.0000  0.0000  0.0604  1.0000  5.0000
## [1] "Correlation with risk:"
## [1] -0.04085654
## [1] "Period 3 grades minus period 2 grades"
## df$Risk: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -8.000   0.000   0.000   0.372   1.000   3.000 
## -------------------------------------------------------- 
## df$Risk: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -9.0000  0.0000  0.0000  0.2148  1.0000  6.0000
## [1] "Correlation with risk:"
## [1] -0.05177298
## [1] "Period 3 grades minus period 1 grades"
## df$Risk: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -9.000   0.000   1.000   0.576   2.000  11.000 
## -------------------------------------------------------- 
## df$Risk: High
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -11.0000   0.0000   1.0000   0.2752   1.0000   6.0000
## [1] "Correlation with risk:"
## [1] -0.06954099

The differences between the two groups was much closer in the previous sets of plots. It seems that students’ grades improve in a similar fashion, no matter what their drinking habits are.

It is very clear here that students in the high risk group typically have below median grades.

Multivariate Analysis

Note: There are a number of combinations of features that I compared, but I will only present the plots that show a stronger relationship.

By looking at the top-left and botton-right of the plot, we can see the pattern most clearly. Students who go out more often with their friends, and study less, are more likely to abuse alcohol, than those who have the opposite habits.

The main grouping that I see here is in the top right, which represents males who go out more often with their friends (higher risk).

Building and Training the Models

## [1] "Perform recursive feature engineering."
## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (10 fold) 
## 
## Resampling performance over subset size:
## 
##  Variables Accuracy  Kappa AccuracySD KappaSD Selected
##          1   0.8385 0.5107    0.05064  0.1529        *
##          2   0.8346 0.4952    0.04982  0.1516         
##          3   0.8327 0.4782    0.05132  0.1697         
##          4   0.8346 0.4879    0.05302  0.1661         
##          5   0.8346 0.4911    0.05302  0.1659         
##          6   0.8365 0.5030    0.04818  0.1450         
##          7   0.8346 0.4947    0.04640  0.1408         
##          8   0.8308 0.4751    0.05270  0.1665         
##          9   0.8308 0.4717    0.05719  0.1799         
##         10   0.8231 0.4557    0.06000  0.1865         
##         56   0.8250 0.4314    0.03563  0.1184         
## 
## The top 1 variables (out of 1):
##    combos
## [1] "Features ranked by importance:"
##    combos   maleOut    simple       sex 
## 29.301525 16.143811 11.092364  2.376313

Train the random forest model.

## + Fold01: mtry=1 
## - Fold01: mtry=1 
## + Fold02: mtry=1 
## - Fold02: mtry=1 
## + Fold03: mtry=1 
## - Fold03: mtry=1 
## + Fold04: mtry=1 
## - Fold04: mtry=1 
## + Fold05: mtry=1 
## - Fold05: mtry=1 
## + Fold06: mtry=1 
## - Fold06: mtry=1 
## + Fold07: mtry=1 
## - Fold07: mtry=1 
## + Fold08: mtry=1 
## - Fold08: mtry=1 
## + Fold09: mtry=1 
## - Fold09: mtry=1 
## + Fold10: mtry=1 
## - Fold10: mtry=1 
## Aggregating results
## Fitting final model on full training set

Train the K-Nearest Neighbours model.

## + Fold01: k=10 
## - Fold01: k=10 
## + Fold02: k=10 
## - Fold02: k=10 
## + Fold03: k=10 
## - Fold03: k=10 
## + Fold04: k=10 
## - Fold04: k=10 
## + Fold05: k=10 
## - Fold05: k=10 
## + Fold06: k=10 
## - Fold06: k=10 
## + Fold07: k=10 
## - Fold07: k=10 
## + Fold08: k=10 
## - Fold08: k=10 
## + Fold09: k=10 
## - Fold09: k=10 
## + Fold10: k=10 
## - Fold10: k=10 
## Aggregating results
## Fitting final model on full training set

Train the Support Vector Machines model.

## + Fold01: sigma=1, C=1, Weight=1 
## - Fold01: sigma=1, C=1, Weight=1 
## + Fold02: sigma=1, C=1, Weight=1 
## - Fold02: sigma=1, C=1, Weight=1 
## + Fold03: sigma=1, C=1, Weight=1 
## - Fold03: sigma=1, C=1, Weight=1 
## + Fold04: sigma=1, C=1, Weight=1 
## - Fold04: sigma=1, C=1, Weight=1 
## + Fold05: sigma=1, C=1, Weight=1 
## - Fold05: sigma=1, C=1, Weight=1 
## + Fold06: sigma=1, C=1, Weight=1 
## - Fold06: sigma=1, C=1, Weight=1 
## + Fold07: sigma=1, C=1, Weight=1 
## - Fold07: sigma=1, C=1, Weight=1 
## + Fold08: sigma=1, C=1, Weight=1 
## - Fold08: sigma=1, C=1, Weight=1 
## + Fold09: sigma=1, C=1, Weight=1 
## - Fold09: sigma=1, C=1, Weight=1 
## + Fold10: sigma=1, C=1, Weight=1 
## - Fold10: sigma=1, C=1, Weight=1 
## Aggregating results
## Fitting final model on full training set

Train the extreme gradient boosting model.

## + Fold01: nrounds=150, max_depth=3, eta=0.1, gamma=0.1, colsample_bytree=1, min_child_weight=0.8, subsample=1 
## - Fold01: nrounds=150, max_depth=3, eta=0.1, gamma=0.1, colsample_bytree=1, min_child_weight=0.8, subsample=1 
## + Fold02: nrounds=150, max_depth=3, eta=0.1, gamma=0.1, colsample_bytree=1, min_child_weight=0.8, subsample=1 
## - Fold02: nrounds=150, max_depth=3, eta=0.1, gamma=0.1, colsample_bytree=1, min_child_weight=0.8, subsample=1 
## + Fold03: nrounds=150, max_depth=3, eta=0.1, gamma=0.1, colsample_bytree=1, min_child_weight=0.8, subsample=1 
## - Fold03: nrounds=150, max_depth=3, eta=0.1, gamma=0.1, colsample_bytree=1, min_child_weight=0.8, subsample=1 
## + Fold04: nrounds=150, max_depth=3, eta=0.1, gamma=0.1, colsample_bytree=1, min_child_weight=0.8, subsample=1 
## - Fold04: nrounds=150, max_depth=3, eta=0.1, gamma=0.1, colsample_bytree=1, min_child_weight=0.8, subsample=1 
## + Fold05: nrounds=150, max_depth=3, eta=0.1, gamma=0.1, colsample_bytree=1, min_child_weight=0.8, subsample=1 
## - Fold05: nrounds=150, max_depth=3, eta=0.1, gamma=0.1, colsample_bytree=1, min_child_weight=0.8, subsample=1 
## + Fold06: nrounds=150, max_depth=3, eta=0.1, gamma=0.1, colsample_bytree=1, min_child_weight=0.8, subsample=1 
## - Fold06: nrounds=150, max_depth=3, eta=0.1, gamma=0.1, colsample_bytree=1, min_child_weight=0.8, subsample=1 
## + Fold07: nrounds=150, max_depth=3, eta=0.1, gamma=0.1, colsample_bytree=1, min_child_weight=0.8, subsample=1 
## - Fold07: nrounds=150, max_depth=3, eta=0.1, gamma=0.1, colsample_bytree=1, min_child_weight=0.8, subsample=1 
## + Fold08: nrounds=150, max_depth=3, eta=0.1, gamma=0.1, colsample_bytree=1, min_child_weight=0.8, subsample=1 
## - Fold08: nrounds=150, max_depth=3, eta=0.1, gamma=0.1, colsample_bytree=1, min_child_weight=0.8, subsample=1 
## + Fold09: nrounds=150, max_depth=3, eta=0.1, gamma=0.1, colsample_bytree=1, min_child_weight=0.8, subsample=1 
## - Fold09: nrounds=150, max_depth=3, eta=0.1, gamma=0.1, colsample_bytree=1, min_child_weight=0.8, subsample=1 
## + Fold10: nrounds=150, max_depth=3, eta=0.1, gamma=0.1, colsample_bytree=1, min_child_weight=0.8, subsample=1 
## - Fold10: nrounds=150, max_depth=3, eta=0.1, gamma=0.1, colsample_bytree=1, min_child_weight=0.8, subsample=1 
## Aggregating results
## Fitting final model on full training set
## [1] "Random Forest Model:"
## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction  Low High
##       Low  70.2 10.6
##       High  6.7 12.5
##                             
##  Accuracy (average) : 0.8269

The accuracy is much higher compared to the previous analysis, but nearly half of the high risk students are being labels as low risk…not something that we wanted to happen.

## [1] "K-Nearest Neighbours Model:"
## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction  Low High
##       Low  63.8  8.7
##       High 13.1 14.4
##                             
##  Accuracy (average) : 0.7827

The accuracy is a little lower compared to the Random Forest model, but the high risk students were predicted more accurately. I would call this a good trade off.

## [1] "Support Vector Machines Model:"
## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction  Low High
##       Low  71.9 11.7
##       High  5.0 11.3
##                             
##  Accuracy (average) : 0.8327

High risk students were predicted with slightly worse than 50% accuracy, but low risk students were predicted very accurately.

## [1] "Extreme Gradient Boosting Model:"
## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction  Low High
##       Low  66.3  9.2
##       High 10.6 13.8
##                             
##  Accuracy (average) : 0.8019

This is rather similar to the K-Nearest Neighbours model. High Risk students were predicted with some accuracy, and the overall accuracy is close to 80%.

## [1] "Random Forest Model:"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Low High
##       Low   80   17
##       High  20   12
##                                          
##                Accuracy : 0.7132         
##                  95% CI : (0.627, 0.7893)
##     No Information Rate : 0.7752         
##     P-Value [Acc > NIR] : 0.9605         
##                                          
##                   Kappa : 0.2062         
##  Mcnemar's Test P-Value : 0.7423         
##                                          
##             Sensitivity : 0.8000         
##             Specificity : 0.4138         
##          Pos Pred Value : 0.8247         
##          Neg Pred Value : 0.3750         
##              Prevalence : 0.7752         
##          Detection Rate : 0.6202         
##    Detection Prevalence : 0.7519         
##       Balanced Accuracy : 0.6069         
##                                          
##        'Positive' Class : Low            
## 
## [1] "K-Nearest Neighbour Model:"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Low High
##       Low   72   13
##       High  28   16
##                                           
##                Accuracy : 0.6822          
##                  95% CI : (0.5944, 0.7613)
##     No Information Rate : 0.7752          
##     P-Value [Acc > NIR] : 0.99450         
##                                           
##                   Kappa : 0.2296          
##  Mcnemar's Test P-Value : 0.02878         
##                                           
##             Sensitivity : 0.7200          
##             Specificity : 0.5517          
##          Pos Pred Value : 0.8471          
##          Neg Pred Value : 0.3636          
##              Prevalence : 0.7752          
##          Detection Rate : 0.5581          
##    Detection Prevalence : 0.6589          
##       Balanced Accuracy : 0.6359          
##                                           
##        'Positive' Class : Low             
## 
## [1] "Support Vector Machines Model:"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Low High
##       Low   87   18
##       High  13   11
##                                           
##                Accuracy : 0.7597          
##                  95% CI : (0.6766, 0.8305)
##     No Information Rate : 0.7752          
##     P-Value [Acc > NIR] : 0.7058          
##                                           
##                   Kappa : 0.2656          
##  Mcnemar's Test P-Value : 0.4725          
##                                           
##             Sensitivity : 0.8700          
##             Specificity : 0.3793          
##          Pos Pred Value : 0.8286          
##          Neg Pred Value : 0.4583          
##              Prevalence : 0.7752          
##          Detection Rate : 0.6744          
##    Detection Prevalence : 0.8140          
##       Balanced Accuracy : 0.6247          
##                                           
##        'Positive' Class : Low             
## 
## [1] "Extreme Gradient Boosting Model:"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Low High
##       Low   66   16
##       High  34   13
##                                           
##                Accuracy : 0.6124          
##                  95% CI : (0.5227, 0.6969)
##     No Information Rate : 0.7752          
##     P-Value [Acc > NIR] : 0.99999         
##                                           
##                   Kappa : 0.0887          
##  Mcnemar's Test P-Value : 0.01621         
##                                           
##             Sensitivity : 0.6600          
##             Specificity : 0.4483          
##          Pos Pred Value : 0.8049          
##          Neg Pred Value : 0.2766          
##              Prevalence : 0.7752          
##          Detection Rate : 0.5116          
##    Detection Prevalence : 0.6357          
##       Balanced Accuracy : 0.5541          
##                                           
##        'Positive' Class : Low             
## 
## [1] "An Ensemble of all the Models:"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Low High
##       Low   70   12
##       High  30   17
##                                           
##                Accuracy : 0.6744          
##                  95% CI : (0.5864, 0.7543)
##     No Information Rate : 0.7752          
##     P-Value [Acc > NIR] : 0.996898        
##                                           
##                   Kappa : 0.2345          
##  Mcnemar's Test P-Value : 0.008712        
##                                           
##             Sensitivity : 0.7000          
##             Specificity : 0.5862          
##          Pos Pred Value : 0.8537          
##          Neg Pred Value : 0.3617          
##              Prevalence : 0.7752          
##          Detection Rate : 0.5426          
##    Detection Prevalence : 0.6357          
##       Balanced Accuracy : 0.6431          
##                                           
##        'Positive' Class : Low             
## 

The initial hope that these models would be more useful than the ones of the previous analysis did not come to fruition. None of these models had a P-value (Accuracy > No Information Rate) that was statistically significant, >= 0.05.

I am not entirely sure what is missing for these models to become useful and statistically significant. Of course it is easier to say that more data would have helped, but what else? New features, such as “has older sibling” (someone to buy them alcohol), “frequency of parents’ consumption of alcohol” (is drinking alcohol something the students see reguarly and become influenced by this habit), or “has their own car” (easier to get up to mischief if they have the freedom of mobility) could have been useful. To reiterate what was said at the start of this analysis, their might be some bias in the data. Since the data is about students taking a Portuguese language course, rather than a mandatory class, such as English or Math, we are limited to building models from students that share one particular interest, rather than observing the full range of students.